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Abstract 

Recent studies have consistently inferred high rates of adaptive molecular 
evolution between Drosophila species. At the same time, the Drosophila genome 
evolves under different rates of recombination, which results in partial genetic 
linkage between alleles at neighboring genomic loci. Here we analyze how 
linkage correlations affect adaptive evolution. We develop a new inference 
method for adaptation that takes into account the effect on an allele at a focal site 
caused by neighboring deleterious alleles (background selection) and by 
neighboring adaptive substitutions (hitchhiking). Using complete genome 
sequence data and fine-scale recombination maps, we infer a highly 
heterogeneous scenario of adaptation in Drosophila. In high-recombining 
regions, about 50% of all amino acid substitutions are adaptive, together with 
about 20% of all substitutions in proximal intergenic regions. In low- 
recombining regions, only a small fraction of the amino acid substitutions are 
adaptive, while hitchhiking accounts for the majority of these changes. 
Hitchhiking of deleterious alleles generates a substantial collateral cost of 
adaptation, leading to a fitness decline of about 30/2N per gene and per million 
years in the lowest-recombining regions. Our results show how recombination 
shapes rate and efficacy of the adaptive dynamics in eukaryotic genomes. 

Author Summary 

Because recombination takes place at a limited rate, alleles at neighboring sites 
in a genome can remain genetically linked over evolutionary periods. In this 
paper, we show that evolutionary forces generated by genetic linkage have 
drastic consequences for the adaptive dynamics in low-recombining parts of the 
Drosophila genome. Our study is based on a new method to analyze allele 
frequencies that is applicable to genome data at both high and low rates of 
recombination. We show that genes in low-recombining regions of the 
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Drosophila genome incur a substantial cost of adaptation, because deleterious 
alleles get fixed more frequently than under high recombination. This cost 
reduces rate and power of the adaptive process. Our results suggest that the 
Drosophila genome has evolved to minimize this cost by placing genes under 
high adaptive pressure in high-recombining regions. 

Introduction 

Genetic linkage imposes evolutionary correlations between neighboring genomic 
loci. Two particular effects are well known: adaptive mutations induce genetic 
hitchhiking of linked neutral and weakly selected variants [1-12], and 
deleterious mutations cause background selection on linked sites[13-20]. Both 
effects reduce sequence diversity, but in different ways. Background selection 
caused by strongly deleterious mutations leads to an unbiased removal of 
genetic diversity, which can be described by a reduced effective population 
size [17, 18]. Genetic hitchhiking in selective sweeps affects common variants in a 
stronger way than rare variants and, hence, distorts the shape of the allele 
frequency spectrum [3,10,21-23]. Both effects depend on the rate of 
recombination, and are expected to be strong evolutionary forces in low- 
recombining regions of the Drosophila genome. 

In this paper, we integrate hitchhiking and background selection into a new 
method to infer rates and the genomic distribution of adaptation under (partial] 
genetic linkage. Our model is based on diffusion theory [24,25]. Its key new 
variable is the effective rate of linked selective sweeps, which governs the 
hitchhiking rate of neutral and weakly selected polymorphisms and can be 
estimated directly from sequence data. We model linked recurrent sweeps as a 
Poisson process, which is often implicit in other approaches [8,10,26,27]. Our 
method exploits the entire allele frequency spectrum, rather than the level of 
diversity alone. 

In Drosophila melanogaster, studying linkage effects has a long history. These 
studies are motivated by a strong correlation between the recombination rate 
and observed levels of diversity [28-30], which has been attributed to 
background selection and hitchhiking [31-36]. Our integrated inference of 
hitchhiking, background selection and adaptive evolution uses data from the 
Drosophila melanogaster genetic reference panel (DGRP) [29], which consists of 
168 complete genome sequences from inbred lines sampled in North Carolina, 
USA. We also use the recently published high-resolution recombination map by 
Comeron and coworkers [30] to analyze polymorphism spectra as a function of 
the recombination rate. 

We show that linkage correlations affect the adaptive process of the Drosophila 
genome in two ways. Background selection explains the broad reduction of 
genetic diversity with decreasing recombination rate, which is consistent with 
previous findings [28-30]. In addition, the low-recombining regions (which 
account for 21% of autosomal sequence] are marked by strongly linked selective 
sweeps that generate substantial hitchhiking and distort allele frequency 
spectra. Our integrated inference method leads to estimates of adaptive rates 
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also in these low-recombining regions, which had to be excluded from most 
previous studies due to the confounding effects of linkage[24,26,37,38]. We find 
a sharp drop in the rate of adaptation, compared to high-recombining regions. In 
addition, we estimate rate and effect of deleterious fixations due to hitchhiking, 
which quantify the cost of adaptation imposed by genetic linkage. 

Results 

Probabilistic evolution of neutral sequence under genetic hitchhiking and 
background selection 

Figure 1 illustrates the evolutionary models used for our analysis. The full model 
of linked adaptation describes the evolution of a focal genomic site, which is 
coupled by background selection to neighboring deleterious variants and by 
hitchhiking to neighboring beneficial variants sweeping through the population 
(Figure la]. We describe both types of linkage interactions by summary 
parameters, which enter an effective single-site model for the focal genomic 
position. Specifically, background selection caused by deleterious mutations 
reduces the genetic diversity at the focal site but leaves the shape of the allele 
frequency spectrum invariant; this effect can be described by a reduction in 
effective population size [13,17]. Hitchhiking in selective sweeps reduces the 
diversity and changes the allele frequency spectrum. We capture the effects of 
hitchhiking by an effective rate of linked selective sweeps. This parameter 
measures sweeps that are close enough to the focal site to affect its alleles by 
hitchhiking. Together, our full model has four parameters: the scaled mutation 
rate 6, the scaled divergence time t , the scaled effective population size A, and 
the effective rate of linked sweeps, v [Materials and Methods and Supplementary 
Text). To quantify the importance of both kinds of linkage interactions in 
different parts of the Drosophila genome, we compare our inference from the full 
linked adaptation model to results from two partial models. First, the 
background selection model retains the coupling of the focal site with 
neighboring deleterious variants but neglects hitchhiking (Figure lb]. This 
model has the three independent parameters 0, t, X and the constraint v = 0. 
Second, the unlinked adaptation model assumes that recombination is strong 
enough to annihilate all evolutionary effects of genetic linkage (Figure lc]. This 
model has the two independent parameters 6, t and the constraint v = 0, A = 1 . 

For all three models, we derive analytic probability distributions for the allele 
frequencies at the focal site (Supplementary Text). These frequency spectra 
distinguish the full linked adaptation model from the background selection 
model and the unlinked adaptation model, which is similar to the model 
introduced by Mustonen and Lassig [24] (Figure Id]. Specifically, a positive rate 
of linked sweeps, v, removes common variants relative to rare variants, which is 
correctly captured by our model. A qualitatively similar depletion of the 
frequency spectrum was observed in rapidly evolving populations involving 
multiple segregating beneficial mutations [23]. The three models also have 
distinct effects on summary statistics of allele frequencies such as the diversity 
and Tajima's D, which measures the relative abundance of intermediate- 
frequency polymorphisms compared to low- and high-frequency variants [39]. 
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The background selection model lowers diversity, but leaves Tajima's D 
invariant. In contrast, the full linked adaptation model lowers diversity and 
generates negative values for Tajima's D, indicating a depletion of common 
variants (Figure le). 

Our approach to model background selection as a simple reduction of the 
effective population size is valid if linked deleterious mutations are under 
sufficiently strong negative selection. For background selection caused by 
weakly deleterious mutations, it has been shown that the effect on the spectrum 
is more complicated than a simple reduction in effective population size 
[13,14,19,20]. For the data set of this study, our simple model captures the 
dominant effect of background selection, which is an unbiased removal of 
diversity with decreasing recombination. 

Using the analytical probability distributions shown in Figure Id, we develop an 
inference framework to estimate the effective parameters 6, X and v and various 
other evolutionary characteristics under the different models. As data input for 
this inference, we use divergence data between D. melanogaster and D. simulans, 
together with outgroup-directed allele frequency distributions in the D. 
melanogaster population (these spectra count the number of alleles that are 
different from the corresponding allele in the outgroup species D. simulans). To 
test our inference framework, we use the linked adaptation model to simulate 
allele frequency data for varying values of v. Our inference recovers all three 
parameters 0, A and v with good accuracy (Figure Id and Supplementary Figure 
2). For later purposes, we also test whether we can infer directional selection on 
the focal site itself (Supplementary Figure S2a-c), with similarly accurate results. 

Quantifying linkage effects from synonymous sites in Drosophila 

We study the autosomal genome sequences of 168 inbred lines of Drosophila 
melanogaster, published in the Drosophila melanogaster Genetic Reference Panel 
(DGRP)[29]. We use the Drosophila simulans reference genome to orient allele 
frequencies of segregating sites and to determine fixed differences. We first 
consider all synonymous sites, binned according to the local recombination rate 
(see Material and Methods and Supplementary Table SI). Summary statistics of 
this binning clearly show a strong dependency on the recombination rate (Figure 
2). First, we observe a moderate increase in the rate of divergence from 
Drosophila simulans (Figure 2a) with decreasing recombination rate. Second, the 
diversity decreases sharply by about a factor of 8 as the recombination rate 
decreases, which has been reported before [28,29] (Figure 2b). Finally, for low 
recombination, the allele frequency spectrum deviates from the standard 
spectrum. This can be seen in the growing difference between two estimators of 
neutral diversity (Figure 2b) and further quantified by Tajima's D [39], which 
drops below -1.0 for zero recombining regions (Figure 2c). While a drop in 
diversity alone can be explained by background selection from linked deleterious 
mutations alone, a drop in Tajima's D is in this case best explained by hitchhiking 
with linked beneficial mutations. 

To quantify further the change in the allele frequency data with the 
recombination rate, we apply our probabilistic model to each recombination bin 
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separately, estimating the population size A, the mutation rate 0 and the effective 
rate of linked drivers v. We report maximum likelihood estimates for these 
parameters for all recombination bins (Figure 3a-c). With decreasing 
recombination rate we observe clines in the parameters, which mirror the clines 
in the summary statistics (Figure 2]. First, the mutation rate increases mildly 
with decreasing recombination rate, which matches the observed increasing 
divergence. This is very similar for the background selection model and the 
linked adaptation model, with slightly higher estimates from the former (Figure 
3a]. Reflecting the strong drop in diversity (Figure 2b], we estimate a drop in 
effective population size for both models (Figure 3b], with a higher population 
size in the linked adaptation model (A = 0.25 versus A = 0.17], because it 
explains part of the reduction in polymorphisms by hitchhiking. Finally, the 
linked adaptation model estimates substantial levels of the rate of linked sweeps 
(v~10] for recombination rates below about 0.4 cM/Mb (0.4 x 10 -8 crossovers 
per nucleotide per generation] (Figure 3c]. This rate of linked sweeps means that 
every site in this region is linked to about one sweep per 80,000 years (assuming 
an effective haploid population size of N 0 = 4 x 10 6 [40], see Supplementary 
Text], which is 10 times higher than the time needed for neutral variants to fix by 
drift. 

The strong dependency of our parameter estimates on the recombination bin is 
clearly visible in the allele frequency spectra (Figure 3d]. First, for the bin with 
highest recombination, the unlinked adaptation model (gray curve] fits the data 
very well, consistent with no strong linkage effects in that bin. The spectrum 
from the lowest recombination bin has a reduced level of polymorphisms, which 
is well explained by the drop in effective population size, captured by both the 
background selection and the linked adaptation model. However, this spectrum 
also exhibits a substantial V-shaped deviation from the background selection 
model prediction (blue curve], which is well captured by the linked adaptation 
model (red curve]. The difference between the two models is also reflected in the 
log-likelihood score (Supplementary Figure S6], which is much larger for the 
linked adaptation model in regions of recombination rates below 0.4 cM/Mb. 

We note that our model does not contain demographic effects, such as 
bottlenecks, which have been used previously to explain the synonymous site 
frequency spectrum in Drosophila [41]. Indeed, demographic effects on the site 
frequency spectrum are negligible for this data set, given the good fits of the 
simple unlinked adaptation model to highly recombining sites. This does not rule 
out that demography affects other observables, in particular haplotype structure 
(see also the Discussion below]. We also tested whether the deviations from the 
neutral spectrum in low recombining regions could be caused by directional 
selection on synonymous sites. We find that in order to explain the depletion of 
intermediate frequency polymorphisms by selection, unrealistically high 
selection coefficients on synonymous sites are necessary (Supplementary Figure 
S3], clearly inconsistent with previous observations [42]. 

Mixed selection model for nonsynonymous and non-coding sites 

We use a mixed model for polymorphism spectra and divergence in non- 
synonymous and non-coding annotation categories that consists of four 
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components (Supplementary Text]: neutral sites, weakly or moderately selected 
sites that contribute to rare variants, sites with adaptive substitutions (seen as 
fixed differences between the two species], and conserved sites under purifying 
selection. The component of weakly selected sites is similar to the neutral 
component with one additional scaled parameter a, which denotes the selection 
coefficient (Supplementary Text and Supplementary Figure S2d]. Together with 
the three neutral parameters, the full mixed model has 7 parameters: three 
weights to parameterize the contributions of the four components (the fourth is 
set by normalization], the scaled selection coefficient a, and the three neutral 
parameters 6, A and v introduced above. We estimate these parameters in a 
hierarchical way: First, we fit the neutral parameters 9, A and v based on the 
synonymous sites in each bin, as already presented above. Second, we obtain 
maximum likelihood estimates of the other four parameters, keeping the neutral 
parameters fixed. To assess the impact of hitchhiking in particular, we compare 
our estimates of this full model to a background selection mixed model, with the 
constraint v = 0, and an unlinked mixed model, with A = 1, v = 0. 

We divide the genome into four further broad annotation categories beyond 
synonymous sites: Intergenic regions, Introns, untranslated regions in exons 
(UTR] and nonsynonymous sites (see Material and Methods). For each 
annotation category we bin all sites according to the same recombination rate 
bins as for synonymous sites and then perform the conditional model 
estimations as described above for each bin separately. Figure 4 shows the allele 
frequency spectrum and the model predictions for the first and the last 
recombination bin for each annotation category, divided by the background 
selection model spectrum as estimated from synonymous sites. This serves to 
highlight the distortions from the standard spectrum: In both high and low 
recombination, we see an overall reduced diversity due to selection, and for low 
recombination we additionally see a relative enrichment of rare variants and 
depletion of common variants. As can be seen, for high recombination (upper 
plots], the unlinked mixed model fits the data well, consistent with the results 
from synonymous sites. For low recombination (lower plots], the background 
selection mixed model is a poor explanation for the data, which clearly exhibits 
the V-shaped distortion observed previously. Here, the linked adaptation mixed 
model fits are substantially better. This is satisfying because the background 
selection mixed model and full mixed models have the same number of free 
parameters (four], since 6, A and v are fixed from synonymous sites, so their 
performance is directly comparable. 

All model parameter estimates with bootstrap error estimates are summarized 
in Supplementary Table S2. We find that the linked adaptation mixed model 
performs substantially better than the background selection mixed model for 
recombination values below 0.4 cM/Mb, with about 16,000 units of log 
likelihood difference for the lowest four recombination bins in total, which is 
highly significant (see Supplementary Figure S6]. 

The evolutionary cause of fixed differences 

Because our full mixed model allows estimation of neutral, weakly selected and 
adaptive fractions of sites, we can specifically estimate how these fractions 
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contribute to fixed differences between the two species. These estimates are 
shown in summary for the full mixed model in Figure 5a (see Supplementary 
Figure S5a for a split into annotation categories and estimates from the 
background selection mixed model], using a more coarse-grained binning for 
clarity (Methods]. As can be seen, below 0.4 cM/Mb most mutations are 
hitchhiking with selective sweeps (dark blue]. This includes neutral and weakly 
selected variants that have reached some frequency by drift and have then been 
picked up by a linked sweep. Overall, the fraction of adaptive substitutions is 
between 10% and 20%. Figure 5b and Supplementary Figures S5b show how 
this fraction of adaptive substitutions is distributed across recombination values 
and annotation classes. We find that in high recombining regions, between 44% 
and 59% of nonsynonymous substitutions are adaptive, confirming previous 
observations in Drosophila [24,29]. The second highest fraction of adaptive 
substitutions in highly recombining regions is seen in proximal intergenic 
regions (20%-25%] and in UTR (15%-20%], consistent with adaptively evolving 
regulatory regions in the untranslated parts of exons. In intergenic regions 
further away from genes and in Introns, we observe only low fractions of 
adaptive substitutions (around and below 15%]. When we look at low- 
recombining regions, substitutions in nonsynonymous, UTR and proximal 
intergenic sequence have only a small adaptive component. In the arguably less 
functional categories like distal intergenics and introns, we see a more erratic 
pattern with an actual increase up to 30% of adaptive substitutions in zero 
recombining regions (Supplementary Figure S5b]. While this signal may indicate 
increased non-coding adaptation in low recombining regions, it may also be 
caused by technical artifacts. In particular, alignment errors between D. simulans 
and D. melanogaster in non-coding sequence can cause increased rates of 
sequence mismatches between the two species and generate a spurious signal of 
adaptation. 

Overall, we observe a consistently smaller fraction of adaptive substitutions for 
low recombination with the full linked adaptation mixed model than with the 
background selection mixed model (Supplementary Figure S5a]. This is expected 
since the former provides a much better fit to the neutral component (Figure 3d] 
than the latter, which allows the full mixed model to explain a larger portion of 
the spectrum using neutral sites. Only the excess of substitutions not explained 
by neutral substitutions are interpreted as being adaptive. 

We compared our estimates of the fraction of adaptive substitutions with the 
generalized McDonald-Kreitman (MK] test [43], which can be corrected for the 
presence of weakly selected sites [29,37]. As shown in Supplementary Figure S4, 
this method is much more conservative than ours, in particular in low 
recombining regions, where the MK estimate for the adaptive fraction is 
estimated to be zero for all annotation categories. This is consistent with an 
observation made by Messer and Petrov [44], and it highlights the importance of 
explicitly using the allele frequency spectrum to estimate parameters such as the 
adaptive fraction of substitutions. The only case for which the test and our 
method agree quantitatively are nonsynonymous substitutions in high 
recombining regions, which is the case the test was originally developed for [43]. 
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Fitness Flux and the selective effect of adaptive substitutions 

Fitness flux measures the speed of adaptation; according to the fitness flux 
theorem, fitness flux is generically positive [45]. In an adaptive process driven by 
the substitution of beneficial mutations, the fitness flux is simply the product of 
the substitution rate and the average selection coefficient of these changes 
[24,45]. In our model, we do not directly infer a selection coefficient of adaptive 
mutations, because any such direct inference would have a high degree of 
uncertainty. However, we can derive an upper bound on the strength of 
adaptation from its hitchhiking effects on synonymous changes. As shown in 
Supplementary Text, the fitness flux 3> is simply related to the rate v of linked 
sweeps and the recombination rate r, 

$ = Cvr 

with a proportionality constant C that is greater than but of order one. Since our 
method cannot infer arbitrarily low levels of hitchhiking (in fact we find v = 0 
for most recombination bins], we set v = 1 as a detection threshold and use the 
above equation to estimate an upper bound on the fitness flux based on that 
threshold. For nonsynonymous sites, Figure 5c shows this upper bound across 
the genome, except for zero-recombining regions (the lowest bin]. Most 
estimates are between 10 and 100 in units of n/2N 0 per site, which is consistent 
with previous estimates in Drosophila [24,46]. We then use this upper bound on 
the fitness flux together with the inferred rate of adaptive nonsynonymous 
substitutions to estimate an upper bound on their selection coefficient 
(Supplementary Text]. We find that in the part of the genome with the lower 
50% of recombination rates, this upper bound on the selection coefficient is 
about s a < 1 x 10 -4 , and about five times higher in regions with higher 
recombination, i.e. s a < 5 x 10 -4 . Previous work in Drosophila has led to 
estimates across four orders of magnitude, s a = 10 -5 [38], s a = 10 -4 [24], 
s a = 10 -3 [27,47], ands a = 10 -2 [26] (see also [40] for a partial summary], 
some of which exceed our estimated upper bound by a factor 100. 

The cost of adaptation in low-recombining regions 

We can use our model to estimate the cost of adaptation imposed by genetic 
linkage. This cost is given by a negative component of the fitness flux, <J>_, which 
is the product of the rate of deleterious substitutions, which mainly fix via 
hitchhiking, and their average selection coefficient, which is negative 
(Supplementary Text]. Figure 5c shows an estimate of the resulting hitchhiking 
flux <J>_ for nonsynonymous sites in different recombination bins. In the lowest- 
recombining regions of the Drosophila genome, we find that <5_ reaches values of 
about 1.6 fi/2N 0 per sequence site, or about 30/2W 0 per million years per gene 
(Supplementary Text]. This cost reaches about 5% of the upper bound on the 
total fitness flux in the second-lowest recombination bin (Figure 5c]. 

A related cost measure is the contribution of deleterious hitchhiking to genetic 
load [48-52]. Our mixed model predicts the stationary probability of any site to 
be fixed in a low- fitness allele (Supplementary Text]. Multiplying this probability 
with the single-site selection coefficient, we obtain a genetic load of about 
40/2N 0 per gene. This load measures the fitness cost of placing an average gene 
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into a region of low recombination; its evolutionary interpretation is discussed 
below. 

Discussion 

In this study, we have developed an analytic model for adaptive evolution under 
partial genetic linkage. This model maps the complex process of correlated 
multi-site evolution onto an effective single-site process with three evolutionary 
forces: positive selection causing primary adaptation, genetic draft inducing 
hitchhiking, and background selection constraining diversity and divergence. 
Despite its simplicity, our model explains allele frequency data across different 
recombination classes of the Drosophila genome with remarkable accuracy 
(Figures 3d and 4). Because our inference method does not use haplotypes, it can 
be applied to bulk sequencing data, which extends its possible range of 
applications. 

Consistent with previous studies, we infer high rates of adaptive evolution in 
high-recombining sequence of the Drosophila genome: about 50% of the 
nonsynonymous substitutions in coding sequence, and 20% of the substitutions 
in UTR and in proximal intergenic sequence are adaptive. We obtain upper 
bounds for the resulting speed of adaptation, which is measured by a fitness flux 
of order 100 fi/2N 0 per sequence site, and for the average selection coefficient of 
adaptive changes, which is of order 10 4 . These bounds follow from a simple 
argument: adaptive processes with higher total fitness flux would distort the 
frequency spectrum of synonymous polymorphisms, which we do not observe in 
the high-recombining regions spanning 80% of the Drosophila genome. This 
argument constrains the average speed of adaptation by hard selective sweeps, 
which lead to substitutions of the beneficial allele and drive the long-term 
adaptive divergence between species. It does not exclude individual selective 
sweeps with far higher selection coefficients. It also does not constrain soft and 
partial sweeps, which involve beneficial alleles arising on diverse genetic 
backgrounds or alleles with a conditional selective advantage [53]. These sweeps 
leave a weaker trace in the synonymous frequency spectrum that hard sweeps. 
Soft sweeps have been inferred by haplotype-based genomic scans for 
adaptation in several systems including Drosophila [54]. 

Remarkably, the allele frequency spectra of the North Carolina flies lack a clear 
footprint of the population's recent demography. About 80% of synonymous 
sites in the autosomal genome show a textbook neutral spectrum with constant 
effective population size. The spectrum in the 20% lowest-recombining 
sequence sites is depleted of common variants, but we attribute this 
recombination class-specific signal to hitchhiking rather than to recent changes 
of population size. We emphasize that our analysis is based on site frequency 
spectra, so this result does not rule out demography being visible in some other 
(e.g., haplotype-based] observables. In other systems, for example in humans, 
demographic effects are more prevalent in the allele frequency data, and our 
method will have to be extended to distinguish them from signals of adaptation 
and of linkage correlations. Similarly, we can extend our method to account for a 
variable density of functional elements, say gene content. This is expected to 
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generate heterogeneous amounts of adaptation, hitchhiking, and background 
selection within one recombination class. 

The most striking result of this study is a strong quantitative relation between 
the amount of adaptation and linkage correlations in the Drosophila genome, 
which is summarized in Figure 5. The fraction of adaptive amino acid 
substitutions drops from about 50% in high-recombining regions to small values 
in the 20% lowest-recombining sites; a similar drop is observed in UTR and 
proximal intergenic regions. The majority of substitutions in all of these 
sequence classes can be accounted for by hitchhiking. We also have shown that 
hitchhiking imposes a substantial cost on adaptation, which is measured by a 
negative fitness flux component <t>_ of about 30/2iV 0 per million years per gene 
and a genetic load of about 40/2iV 0 per gene. Together, we obtain a complex 
picture of adaptation in low-recombining regions: linkage interactions reduce 
rate and power of primary selective sweeps by hitchhiking. In a continual 
adaptive process, the fitness cost of hitchhiking is compensated by a cascade of 
secondary adaptive changes at the hitchhiking sites. This complexity of the 
genomic dynamics of adaptation is a generic consequence of linkage interactions, 
which become a strong evolutionary force under low recombination [55-58]. We 
have shown that the interplay of adaptation and linkage interactions already 
generates strong effects in Drosophila, a species with overall high recombination 
rates. These effects are expected to be even stronger in other species with lower 
recombination rates that result, for example, from alternating sexual and asexual 
reproductive modes. The salient point about Drosophila is that recombination 
rates and, hence, the strength of linkage correlations vary strongly within its 
genome, with a broad decrease from central to distal parts of the chromosomes 
[30]. Thus, our findings may suggest that the distribution of genes in the 
Drosophila genome results, in part, from an adaptive minimization of the cost of 
adaptation: genes under high adaptive pressure are predominantly placed in 
high-recombining genomic regions. In this way, the interplay of adaptation and 
genetic linkage can shape the large-scale genome architecture. 



Material and Methods 

Genomic Data 

We downloaded the complete genome sequences of 168 lines from the 
Drosophila Melanogaster Reference Panel (DGRP) from the DGRP website 
(http://dgrp.gnets.ncsu.edu] as fasta files. We downloaded the reference 
sequence from Drosophila simulans, aligned to the reference sequence of 
Drosophila melanogaster. We computed outgroup directed allele frequencies at 
all sites at which a] there is a valid Drosophila simulans allele, b] at least 150 
lines of the DGRP sequences have a called allele (see Supplementary Table SI). 
To simplify downstream analysis, we normalized all sites to 150 called alleles. 
Specifically, if m > 150 alleles are called, and k of those are different from the 
simulans allele, we computed the normalized outgroup directed allele frequency 
(allele count) as 
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k 1 
150 - + - 

m 2. 

Sequence Annotation 

We downloaded gene annotations from flybase [59] and defined annotation 
categories as follows: 

• INTERGENIC FAR: Intergenic regions that are at least 5kb away from 
genes 

• INTERGENIC MEDIUM: Intergenic regions within 5kb distance to the next 
gene, but further away than 500bp 

• INTERGENIC NEAR: Intergenic sites within 500bp of a gene 

• INTRON: Introns on protein-coding genes 

• UTR: untranslated regions on the exons 

• SYNONYMOUS: protein-coding sites on the reference at which none of the 
three possible point mutation changes the encoded amino acid 

• NONSYNONYMOUS: protein-coding sites on the reference at which any of 
the three possible point mutations changes the encoded amino acid. 

In some figures we joined the intergenic categories where appropriate. Most 
genes have multiple associated transcripts due to alternative splicing. We chose 
the transcript corresponding to the longest encoded protein coding sequence for 
each gene and annotated introns, UTRs, synonymous and nonsynonymous sites 
according to this one transcript. See Supplementary Table SI for the number of 
sites in a given annotation category on the different chromosomes. 

Recombination Rate Binning 

Recombination maps were obtained from Comeron et al. [30] through their 
website http://www.recombinome.com, defined as mean rates within lOOkb 
windows. We used the recombination map to annotate every site in the 
Drosophila genome. We then used only sites in the SYNONYMOUS annotation 
category on the autosomal chromosomes (2L, 2R, 3L and 3R) and defined 
quantile boundaries on this set. Specifically, we sorted all recombination rate 
values of this set of sites and determined recombination rate boundaries by 
dividing the data set into 2 1 equally large subsets of values. We then used these 
quantile boundaries to bin all sites (not just those in category SYNONYMOUS) 
into bins according to their local recombination rate. Here are the quantile 
boundaries used in this study for autosomal data (in cM/Mb): 0.0, 0.069, 0.217, 
0.415, 0.44, 0.821, 1.055, 1.29, 1.415, 1.592, 1.741, 1.938, 2.169, 2.354, 2.612, 
2.838, 3.156, 3.461, 3.796, 4.244, 5.395, Infinity. For figure 5, we used a more 
coarse binning merging bins [0], [1, 2], [3, 4, 5], [6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 
16] and [17, 18, 19, 20]. 

For plotting purpose only, we show allele frequency spectra with averaged 
number of counts in neighboring allele frequencies. Specifically, of the 151 allele 
frequency values (incl. 0 and 150), we average values in groups of 2 from 
frequency 10 through frequency 11, and in groups of 3 from frequency 12 
through 44 and from frequency 120 through 149. We average values in groups of 
5 from frequency 45 through 119. 
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Summary of model parameter estimation 

In the Supplementary Text we derive the allele frequency distribution for several 
basic models, which all derive from the full probability P (k; m,T,B, a, v, A) 
to observe in m samples of the ingroup species k alleles which differ from some 
outgroup. The parameters are the scaled time to the common ancestor of the two 
species t = t/2N 0 , the scaled mutation rate 9 = 2N 0 n, the scaled selection 
coefficient a = 2N 0 s, the scaled rate of linked selective sweeps v = 2N 0 V and 
the scaled effective population size A = N/N 0 . The basic models are: 

^unlinked (fr> ™, T, 9) = P(k; 771, T, 9, 0,0,1), 

Pbgs(^; m, t, 9, A) = P(k; m, x, 9, 0,0, A), 
linked (k; m, t, 9, v, A) = P(k; m, t, 9, 0, v, A), 

P selection(k;m,T,0,(7,A) = P(k', 771, T, 9, (J, 0, A). 

These models are used to estimate parameters based on synonymous sites only. 
To model other annotation categories, we introduce mixed models with the 
following components: 

• A neutral component with fraction c n , modeled by one of P U niinked< ^bgs or 
p 

1 linked - 

• A weakly selected component with fraction c w , which is modeled by the 
most general model P, but with a constraint a > 1. 

• A fraction of adaptive substitutions with fraction c a . 

• A fraction of additional conserved sites with fraction c c = (1 — c n — c w — 
c a )- 

Taking the linked adaptation model as the neutral model, the full mixed model is: 

^linked mixed (Hi ^1> T, 9, O, V, A, C n , C w , C a ) 

= c n P Vmkei (k; 771, t, 9, A) + c w P(k; m, t, 9, a, v, A) + c a S kiin + c c S ki0 
Similar mixed models are defined using the background selection or the unlinked 
adaptation model as neutral component, with accordingly fewer neutral 
parameters. 

To estimate parameters, we consider a data set of outgroup-directed allele 
frequencies with a fixed sample size m. We denote the number of sites with allele 
frequency k by n k . The total log-likelihood of the data given parameters 0 is 
then: 

m 

L({n k ; 771, 0) = ^ 7i fe log (P(k; m, 0)) 
fc=o 

where Pisa placeholder for the appropriate model, and 0 denotes the set of 
model parameters. Parameter estimates are obtained by maximization of the log- 
Likelihood: 

0 = argmax£({n fe ;777, 0). 
e 

As detailed in Supplementary Text, we do not simultaneously estimate all 7 
parameters of the mixed model, but use a hierarchical approach, first estimating 
the neutral parameters from synonymous sites only. 

Bootstrapping 

We use bootstrapping to obtain error estimates for all parameters and derived 
estimates (i.e. fractions of substitutions and genetic load). Each bootstrap sample 
is generated from the frequency count data in a given bin, by resampling all 
frequency counts with replacement from the original counts. We obtain a 
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standard error estimate for each parameter by taking the standard deviation of 
that parameter across 20 bootstrap samples. 

Simulations 

We simulated the background selection and linked adaptation models using a 
Monte Carlo method. In the standard simulation, a population with N = 1000 
individuals is simulated at a single site with two alleles. Mutations occur 
randomly with rate fi. Each generation is sampled with replacement from the 
previous generation. We introduce selection by assigning a modified sampling 
weight p = exp (s), where s is the selection coefficient. Linked selective sweeps 
occur with rate V. For each sweep we choose a linked allele with a probability 
equal to its frequency x. A sweep instantaneously fixes that allele, setting x = 1. 

For a single sample, we start with an equilibrated allele frequency as ancestral 
value (obtained by simulating a single site for 2//J. generations] and then 
simulate two separate populations for t generations, starting with the ancestral 
allele frequency. We sample a single outgroup allele and 20 ingroup alleles from 
the two evolved populations, respectively. The result is a single outgroup- 
directed allele frequency k, obtained by counting the number of ingroup alleles 
that are different from the outgroup. We simulate 40,000 independent samples 
before testing parameter inference based on the resulting spectrum. 

Implementation 

The implementation of the inference method is available under 

https : / / github . com/ stschif f /hf it. 
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Figures 

Figure 1: Models of adaptation under linkage 

(a] Linked adaptation model: The evolution of a neutral or weakly selected focal 
site (black] involves linkage-generated interactions with neighboring beneficial 
mutations (red] and deleterious mutations (blue]. We describe these 
interactions by two effective model components: deleterious mutations lower 
the effective population size via background selection (A < 1], beneficial 
mutations generate an effective rate of linked sweeps (v > 0.] We compare this 
model with two partial models: (b] Background selection model: We include 
linkage interactions only with neighboring deleterious mutations and disregard 
hitchhiking (i.e., v = 0.] (c] Unlinked adaptation model: In this single-site model, 
the focal site evolves independently of its genomic neighborhood (i.e., 
A = 1, v = 0.] (d] Frequency distribution of single-nucleotide polymorphisms at 
the focal site for the linked adaptation model (red], the background selection 
model (blue], and the unlinked adaptation model (gray]. Analytical spectra 
given by our model are compared to simulations for a Wright-Fisher population 
(see Material and Methods), (e] Linkage effects on the polymorphism spectrum 
can be observed in the sequence diversity, n, and in Tajimas D, which measures 
the depletion of intermediate-frequency polymorphisms. 
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Figure 2: Summary statistics of synonymous sites 

This figure shows summary statistics of the polymorphism and divergence data 
from synonymous sites in Drosophila, as a function of the recombination rate. 
The divergence (a) stays roughly constant, with a mild increase with decreasing 
recombination rate. The diversity (b) drops sharply as the recombination rate 
goes to zero, which is seen in both standard estimators B s and 9 n (see 
Supplementary Text], c] Tajima's D, a measure of the distortion of the allele 
frequency spectrum, is the normalized difference between the two standard 
estimators in (b]. It becomes substantially negative for decreasing 
recombination, indicating hitchhiking. 
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Figure 3: Neutral model parameters from synonymous sites 
This figure shows the estimated model parameters for the linked adaptation and 
the background selection model from synonymous sites. Error bars are obtained 
by bootstrapping from the data, as detailed in Methods. With decreasing 
recombination rate, the mutation rate (a] increases, and the effective population 
size (b) decreases, but less so for the linked adaptation model (red]. The effective 
rate of linked sweeps (c) increases sharply in regions of p < OAcM/Mb. (d) 
Synonymous sites allele frequency spectra in low and high recombining regions 
in Drosophila melanogaster differ substantially: High recombining regions 
(empty circles), very closely match the expected neutral allele frequency 
spectrum without linkage (gray line). In low recombining regions (filled circles), 
the linked models (blue and red) both capture the decrease in the level of 
polymorphisms, but only the linked adaptation model (red) also captures the 
observed distortion. In (d) data points are averaged over neighboring frequency 
values for plotting purpose only (see Methods). 
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Figure 4: Nonsynonymous and Noncoding model fits 

Shown are the site frequency spectra for all non-synonymous and non-coding 
annotation categories, divided by the best neutral unlinked model on 
synonymous sites (see Figure 3d]. The upper plots show data and model fit for 
the highest recombination bin in all annotation categories. The lower plot shows 
data and fits for the lowest recombination bin, for which rare variants are 
relatively more frequent than common variants, due to hitchhiking, as correctly 
captured by our linked adaptation mixed model. Data points are averaged over 
multiple counts for plotting purpose only (Methods]. 
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Figure 5: Sequence divergence statistics 

This figure shows how fixed differences between D. melanog aster and D. 
simulans are distributed in different sequence annotation classes and for 
different recombination rates . (a] Overall numbers of substitutions. In high- 
recombining regions (with recombination rates > 0.4 cM/Mb], most 
substitutions are caused by genetic drift or are adaptive, while hitchhiking is 
negligible. In low-recombining regions (with recombination rates < 0.4 cM/Mb], 
hitchhiking influences most substitutions. We also observe an increase of 
adaptive substitutions in intergenic regions (see text], (b] Fraction of adaptive 
substitutions in different annotation classes. In nonsynonymous, UTR, and 
proximal intergenic sequence, this fraction decreases with decreasing 
recombination rate (cf. Figure S5 b for other annotation classes], (c] Fitness flux. 
Total fitness flux (upper bound, see text] and negative (hitchhiking] component 
in coding sequence. These estimates indicate that the overall speed of adaptation 
decreases, while the cost of adaptation increases with decreasing recombination 
rate (see text]. For clarity, In (b] and (c] we show zero valued data points with a 
small positive offset. 
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